POLI 572B

Michael Weaver

March 1, 2024

Least Squares Continued

Objectives

Recap

  • Causal Estimands \(\to\) CEF
  • Mean as least-distance prediction

Objectives

Bivariate Regression

Matrix Algebra

  • Key mathematical operations (quick review)
  • Geometric intuition: vectors, matrices, projections
  • Derive the mean

Deriving Least Squares

  • Linear algebra derivation of mean
  • Linear algebra derivation of bivariate regression
  • Key insights!

Multivariate Least Squares

  • Mathematical requirements
  • Linear independence
  • Intuition for what it does
  • What does “controlling” mean?

Recap

Causal Estimands

Seen several approaches to estimating causal effects

  • all causal estimands we have seen are average effects: ACE, ATT, etc.

  • all estimands involves comparing average outcomes of \(Y\) across different values of \(D\) or \(Z\)

  • conditioning involves plugging in average outcome of \(Y\) at values of \(D,X\).

Conditional Expectation Function

the conditional expectation function (Angrist and Pischke)

expectation: because it is about the mean: \(E[Y]\)

conditional: because it is conditional on values of \(X\)\(E[Y |X]\)

function: because \(E[Y | X] = f(X)\), there is some mathematical mapping between values of \(X\) and \(E[Y]\).

\[E[Y | X = x] = f(x)\]

Conditional Expectation Function

The difficulty is: how do we find the function?

  • a function takes some value of \(X\) and uniquely maps it to some value of \(E[Y]\)
  • we need to “learn” this function from the data
    • how do we learn without “overfitting” the data?
  • depending on how much data we have, we have to make some choice of how to interpolate/extrapolate when “learning” this function

Conditional Expectation Function

  • one easy choice is to approximate the CEF as linear.
  • That is to say \(E[Y]\) is linear in \(X\).
  • The function takes the form of an equation of a line.
  • This leads us to bivariate regression

Bivariate Regression

Beyond the Mean

We say the mean is a way of choosing some \(\hat{y}\) as a prediction of values of \(y\), where \(\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}\) are prediction errors or residuals.

  • we choose \(\hat{y}\) such that this vector of points has closest possible distance to vector \(y\) (length of \(\mathbf{e}\) minimized by being orthogonal to \(\hat{y}\cdot\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}\))

Bivariate regression chooses \(\hat{y}\) as closest possible prediction of \(y\) with the form

\[\mathbf{\hat{y}} = b_0 + b_1\cdot \mathbf{x}\]

An intercept \(b_0\) and coefficient \(b_1\) multiplied by \(\mathbf{x}\)

Which line?

Graph of Averages

Which line?

The red line above is the prediction using least squares.

It closely approximates the conditional mean of son’s height (\(y\)) across values of father’s height (\(x\)).

How do we obtain this line mathematically? (proof/derivation here)

Bivariate Regression

The slope:

\[b_1 = \frac{Cov(x,y)}{Var(x)}\]

  • Expresses how much mean of \(y\) changes for a 1-unit change in \(x\)
  • When expressed as function of correlation coefficient \(r\), we see this rise (\(SD_y\)) over the run (\(SD_x\))

Covariance (remember)

\(Cov(X,Y) = \frac{1}{n}\sum\limits_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\)

\(Cov(X,Y) = \overline{xy} - \bar{x}\bar{y}\)

Pearson Correlation

\(r(x,y) = \frac{Cov(x,y)}{SD(x)SD(y)}\)

\(b_1 = \frac{Cov(x,y)}{SD(x)SD(y)}\frac{SD(y)}{SD(x)}, \frac{Cov(x,y)}{Var(x)}\)

Bivariate Regression

The Intercept:

\[b_0 = \overline{y} - \overline{x}\cdot b_1\]

Shows us that at \(\bar{x}\), the line goes through \(\bar{y}\). The regression line (of predicted values) goes through the point \((\bar{x}, \bar{y})\) or the point of averages.

Practice:

  • write a function to calculate slope and intercept
  • write a function to calculate y-hat, taking intercept, slope, values of x

Limits

Why are these the equations?

What do these equations have to do with the mean?

What do these equations tell us about “controlling” for variables?

  • Need linear algebra to understand.

Linear Algebra

Quick Review

We assume you have watched this series, Chapters 1-9.

Ask for clarification where required.

What is a vector?

  • a vector is an \(n \times 1\) or \(1 \times n\) array of numbers.
  • these values together represent a point that lives in \(n\) dimensional space
  • each number corresponds to a dimension in that space:
    • e.g. can think of a vector as going \(x\) units along X axis and \(y\) units along Y axis and drawing arrow from origin \(\begin{pmatrix}0 \\ 0\end{pmatrix}\) to that point

\[v = \begin{pmatrix}3 \\ 5 \end{pmatrix} = \begin{pmatrix}x \\ y \end{pmatrix}\]

What is a vector?

Vectors can be added:

vectors of the same dimensions can be added element-by-element.

For example:

\[\begin{pmatrix}1 \\ 1 \end{pmatrix} + \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix}-1 \\ 4 \end{pmatrix}\] Equivalent to putting the second vector’s tail at the tip of the first vector, following it the end.

Vectors Can be Added

Vectors can be Scaled

Vectors can be multiplied by a number, element-by-element.

\[2.5 \cdot \begin{pmatrix}1 \\ 1 \end{pmatrix} = \begin{pmatrix} 2.5 \\ 2.5 \end{pmatrix}\] Equivalent to stretching out this vector by factor of \(2.5\)

\[0.5 \cdot \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix} -1 \\ 1.5 \end{pmatrix}\] Equivalent to squishing this vector by factor of \(0.5\)

\(a = 2.5 \cdot \begin{pmatrix}1 \\ 1 \end{pmatrix} = \begin{pmatrix} 2.5 \\ 2.5 \end{pmatrix}; \ b = 0.5 \cdot \begin{pmatrix}-2 \\ 3 \end{pmatrix} = \begin{pmatrix} -1 \\ 1.5 \end{pmatrix}\)

Vectors have a span

The span of a vector is the set of points which it could reach by scaling it up/down by some factor.

For the vector \(\begin{pmatrix}1 \\ 1 \end{pmatrix}\), the span is the straight line stretching from \(\begin{pmatrix}-\infty \\ -\infty \end{pmatrix}\) to \(\begin{pmatrix}\infty \\ \infty \end{pmatrix}\)

The span of any vector always goes through the origin: Why?

Basis Vectors

We can think of any vector as being decomposed into movement along each of the basis vectors - unit-length (length \(1\)) vectors along each of the dimensions of the space (e.g. \(x,y,z, etc\)).

Basis Vectors

For instance, \(\begin{pmatrix}3 \\ 4 \\ 5 \end{pmatrix}\) can be achieved by adding up:

  • \(3 \times \begin{pmatrix}1 \\ 0 \\ 0 \end{pmatrix}\) (basis vector in \(x\) axis)
  • \(4 \times \begin{pmatrix}0 \\ 1 \\ 0 \end{pmatrix}\) (basis vector in \(y\) axis)
  • \(5 \times \begin{pmatrix}0 \\ 0 \\ 1 \end{pmatrix}\) (basis vector in \(z\) axis)

Note this kind of addition is only possible because \(x,y,z\) are perpendicular or orthogonal to each other

We can change the space a vector lives in by giving it new basis vectors.

Matrices

Matrices are 2-dimensional arrays of numbers \(m \times n\), essentially multiple \(n \times 1\) vectors stuck side by side.

  • better understood a linear transformation: set of instructions that transforms a vector by setting new locations for the basis vectors
  • each column in the matrix indicates the new position of the basis vector (in terms of the current coordinate space)
  • if matrix is not square you transform between spaces with different dimensions

Matrices can be Multiplied

If matrices must have inner dimensions that match, can be multiplied

  • \(m \times n\) matrix \(A\) can be multiplied by matrix \(B\) if \(B\) is \(n \times p\).
  • Matrix \(AB\) is \(m \times p\)
  • Rows of \(A\) multiplied with columns of \(B\) and then summed
  • Multiplication not commutative: cannot multiply \(BA\)
  • Multiplication not commutative: \(B'A' = (AB)' \neq AB\)

Matrices can be Multiplied

\[\begin{pmatrix} 1 & 1 \\ -1 & 2 \end{pmatrix} \times \begin{pmatrix} 1 & -1 \\ 1 & 2 \end{pmatrix} =\]

\[\begin{pmatrix} (1 \cdot 1)+(1\cdot1) & (1\cdot-1) + (1 \cdot 2) \\ (-1\cdot1) + (2\cdot1) & (-1\cdot-1) + (2\cdot2) \end{pmatrix} = \]

\[ = \begin{pmatrix} 2 & 1 \\ 1 & 5 \end{pmatrix}\]

Vectors and Matrices can be transposed:

transposition: rotate a matrix/vector so that columns turn into rows and vice versa:

\[u = \begin{pmatrix} 1 \\ 5 \\ -2 \end{pmatrix}\]

\[u^T = u' = \begin{pmatrix} 1 & 5 & -2 \end{pmatrix}\]

Matrix Example (first row to first column, second row to second column, etc.)

Orthogonal Projection

We can think about how much one vector \(w\) can be captured by moving along another vector \(v\):

  • If we imagine the sun shining directly down onto \(v\) (perpendicular to \(v\)), the shadow cast by \(w\) on \(v\) is the orthogonal projection of \(w\) on \(v\). (does this seem familiar??)

Dot Products

If \(u\) and \(v\) are \(n \times 1\) vectors, the inner product or dot product of \(u \bullet v = u' \times v\); \(u'\) is transpose of \(u\).

\[u = \begin{pmatrix} 1 \\ -2 \end{pmatrix}; \ v =\begin{pmatrix} 4 \\ 2 \end{pmatrix} \]

\[u \cdot v = \begin{pmatrix} 1 & -2 \end{pmatrix} \begin{pmatrix} 4 \\ 2 \end{pmatrix} = 0\]

Dot Products

The dot product is equal to the length of the projection of \(u\) on \(v\) multiplied by the length of \(v\)

  • if the dot product is equal to \(0\), then \(u\) and \(v\) are orthogonal or perpendicular (no projection, no “shadow”)
  • have we seen anything that looks like this dot product before?
  • matrix multiplication, covariance!

Matrix Inversion

In addition to multiplying matrices (applying a linear transformation), we can “undo” this multiplication by multiplying by the inverse of a matrix: this is like division.

  • Inverse \(A^{-1}\) of matrix \(A\) that is square \(3\times 3\) (generally, \(p\times p\) has the property:

\[A \times A^{-1} = A^{-1} \times A = I_{3 \times 3} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}\]

This is an identity matrix with 1s on diagonal, 0s everywhere else. \(A_{p \times p} \times I_{p \times p} = A\)

  • identity matrix is matrix equivalent of 1: \(A \times I = A\)

Matrix Inversion

If a matrix multiplied by its inverse gives Identity matrix…

How does this relate to orthogonality?

  • Row \(i\) of inverse is orthogonal to column \(j\) of matrix for \(i \neq j\)
  • Because dot products of row \(i\) and column \(j\) have dot product of \(0\)
  • Multiply a matrix by its inverse transforms vectors to be orthogonal

We keep talking about orthogonality because it is key to understanding what least squares does

Mean

Deriving the mean:

Imagine we have a variable \(Y\) that we observe as a sample of size \(n\). We can represent this variable as a vector in \(n\) dimensional space.

\[y = \begin{pmatrix} 3 \\ 5 \end{pmatrix}\]

Deriving the mean:

We want to pick one number (a scalar) \(\hat{y}\) to predict all of the values in our vector \(y\).

This is equivalent to doing this:

\[y = \begin{pmatrix}3 \\ 5 \end{pmatrix} \approx \hat{y} \begin{pmatrix}1 \\ 1 \end{pmatrix}\]

Choose \(\hat{y}\) on the blue line at point that minimizes the distance to \(y\).

Deriving the mean:

\(y = \begin{pmatrix}3 \\ 5 \end{pmatrix}\)

can be decomposed into two separate vectors: a vector containing our prediction (\(\hat{y}\)):

\(\begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix} = \hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\)

and another vector \(\mathbf{e}\), which is difference between the prediction vector and the vector of observations:

\(\mathbf{e} = \begin{pmatrix}3 \\ 5 \end{pmatrix} - \begin{pmatrix} \hat{y} \\ \hat{y} \end{pmatrix}\)

Deriving the mean:

This means our goal is to minimize \(\mathbf{e}\).

How do we find the closest distance? The length of \(\mathbf{e}\) is calculated by taking:

\[len(\mathbf{e})= \sqrt{(3-\hat{y})^2 + (5 - \hat{y})^2}\]

When is the length of \(\mathbf{e}\) minimized?

  • when angle between \(\hat{y} \begin{pmatrix} 1 \\ 1 \end{pmatrix}\) and \(\mathbf{e}\) is \(90^{\circ}\).

Deriving the mean:

Deriving the mean:

We know that two vector are orthogonal (\(\perp\)) when their dot product is \(0\), so we can create the following equality and solve for \(\hat{y}\).

\(\mathbf{e} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)

\((\begin{pmatrix}3 & 5 \end{pmatrix} - \begin{pmatrix} \hat{y} & \hat{y} \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)

\((\begin{pmatrix}3 & 5 \end{pmatrix} - \hat{y} \begin{pmatrix} 1 & 1 \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix} = 0\)

Deriving the mean:

\((\begin{pmatrix} 3 & 5 \end{pmatrix} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix}) - (\hat{y} \begin{pmatrix} 1 & 1 \end{pmatrix} \bullet \begin{pmatrix} 1 \\ 1 \end{pmatrix}) = 0\)

\((8) - (\hat{y} 2) = 0\)

\(8 = \hat{y} 2\)

\(4 = \hat{y}\)

Deriving the mean:

More generally:

\(\mathbf{e} \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)

\((\begin{pmatrix} y_1 & \ldots & y_n \end{pmatrix} - \begin{pmatrix} \hat{y} & \ldots & \hat{y} \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)

\((\begin{pmatrix} y_1 & \ldots & y_n \end{pmatrix} - \hat{y}\begin{pmatrix} 1 & \ldots & 1 \end{pmatrix}) \bullet \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = 0\)

More generally:

\((\sum\limits_{i=1}^{n} y_i\cdot1) - \hat{y} \sum\limits_{i=1}^{n} 1 = 0\)

\(\sum\limits_{i=1}^{n} y_i = \hat{y} n\)

\(\frac{1}{n}\sum\limits_{i=1}^{n} y_i = \hat{y}\)

The Mean

What is the mean of our residuals \(\mathbf{e}\)

  • If \(e = y - \hat{y}\), and \(\hat{y}\) is mean of \(y\), mean of \(e\) must be \(0\)
  • We choose \(e\) orthogonal to \(\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}\), so their dot product is \(0\). That means sum of \(e\) must be \(0\), so mean of \(e\) must be \(0\).

Least Square wih Linear Algebra

Deriving Least Squares

Regression works similarly:

Rather than project the \(n \times 1\)-dimensional vector \(\mathbf{y}\) into one dimension (as we did with the mean), we project it into \(p\) (number of parameters) dimensional subspace. Hard to visualize, but we still end up minimizing the distance between our \(n\) dimensional vector \(\mathbf{\hat{y}}\) and the vector \(\mathbf{y}\).

  • If we have \(n\) of \(3\) and a bivariate regression, we find a \(\mathbf{\hat{y}}\) in \(2\) dimensions that is nearest \(\mathbf{y}\)

Deriving Least Squares

Given \(\mathbf{y}\), an \(n \times 1\) dimensional vector of all values \(y\) for \(n\) observations

and \(\mathbf{X}\), an \(n \times 2\) dimensional matrix (\(2\) columns, \(n\) observations). We call this the design matrix. A vector of \(\mathbf{1}\) (for an intercept), a vector \(x\) for our other variable.

\(\mathbf{\hat{y}}\) is an \(n \times 1\) dimensional vector of predicted values (for the mean of Y conditional on X) computed by \(\mathbf{X\beta}\). \(\mathbf{\beta}\) is a vector \(p\times 1\) of (parameters) that we multiply by \(\mathbf{X}\).

We’ll assume there are only two parameters in \(\mathbf{\beta}\): \(b_0,b_1\) so that \(\hat{y_i} = b_0 + b_1 \cdot x_i\), so \(p = 2\)

Deriving Least Squares

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}; \beta = \begin{pmatrix} b_0 \\ b_1 \end{pmatrix}\]

Deriving Least Squares

\[\widehat{y_i} = b_0 + b_1 \cdot x_i\]

\[\widehat{y}_{n \times 1} = \mathbf{X}_{n \times p}\beta_{p \times 1}\]

\[\begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \end{pmatrix} = \widehat{y} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix}\]

\(\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}\) gives us the residuals.

Deriving Least Squares

We want to choose \(\mathbf{\beta}\) or \(b_0,b_1\) such that the distance between \(\mathbf{y}\) and \(\mathbf{\hat{y}}\) is minimized. Or sum of squared residuals is minimized.

Like before, the distance is minimized when the vector of residuals \(\mathbf{y} - \mathbf{\hat{y}} = \mathbf{e}\) is orthogonal or \(\perp\) to \(\mathbf{X}\)

Deriving Least Squares

\(\mathbf{X}'_{p\times n}\mathbf{e}_{n\times1} = \begin{pmatrix} 0_1 \\ \vdots \\ 0_p \end{pmatrix} = \mathbf{0}_{p \times 1}\)

\(\mathbf{X}'(\mathbf{Y} - \mathbf{\hat{Y}}) = \mathbf{0}_{p \times 1}\)

\(\mathbf{X}'(\mathbf{Y} - \mathbf{X\beta}) = \mathbf{0}_{p \times 1}\)

\(\mathbf{X}'\mathbf{Y} - \mathbf{X}'\mathbf{X{\beta}} = \mathbf{0}_{p \times 1}\)

\(\mathbf{X}'\mathbf{Y} = \mathbf{X}'\mathbf{X{\beta}}\)

\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)

Deriving Least Squares

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\]

This is the matrix formula for least squares regression.

If \(X\) is a column vector of \(1\)s, \(\beta\) is just the mean of \(Y\). (We just did this)

If \(X\) is a column of \(1\)s and a column of \(x\)s, it is bivariate regression. (algebraic proof showing equivalence here)

We can now add \(p > 2\): more variables

Example:

n = 5
x = rep(1, 5)
y = c(1,2,1,6,3)

#(x'x)^-1 x'y
#t() gives transpose
#solve() gives inverse
#%*% is matrix/vector multiplication
solve(t(x) %*% x) %*% t(x) %*% y
##      [,1]
## [1,]  2.6
#mean of y
mean(y)
## [1] 2.6

Example:

Let’s do bivariate regression:

\(Y_i = a + b \cdot x_i\)

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]

Take \(x = 0,2,4,6,8,10\), \(y = 0, 12, 21, 31, 40, 50\)

Solve for \(a\) and \(b\) using \(b = \frac{Cov(x,y)}{Var(x)}\) and \(a = \bar{y} - b \bar{x}\) Solve for \(a\) and \(b\) using the matrix calculation.

x = c(0,2,4,6,8,10)
y = c(0, 12, 21, 31, 40, 50)

#Version 1
b_1 = cov(x,y)/var(x)
a_1 = mean(y) - mean(x)*b_1

#Version 2
X = cbind(1, x)
beta = solve(t(X) %*% X) %*% t(X) %*% y

#Estimates
c(a_1, b_1)
## [1] 1.095238 4.914286
beta
##       [,1]
##   1.095238
## x 4.914286

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(1\). the mean of the residuals is always zero (if we include an intercept). Because we included an intercept (\(b_0\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.

Why?

the mean of the residuals is always zero.

We choose \(\begin{pmatrix}b_0 \\ b_1 \end{pmatrix}\) such that \(e\) is orthogonal to \(\mathbf{X}\). One column of \(\mathbf{X}\) is all \(1\)s, to get the intercept (recall how we used vectors to get the mean). So \(e\) is orthogonal to \(\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}\).

\[\mathbf{1}'e = 0\]

And if this is true, the \(\sum e_i = 0\) so \(\frac{1}{n}\sum e_i = 0\).

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.

Recall that \(Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e}\)

We chose \(\beta\) (\(a,b\)) such that \(X'e = 0\) so they would be orthogonal.

\(X'e = 0 \to \sum x_ie_i = 0 \to \overline{xe}=0\);

And, from above, we know that \(\overline{e}=0\);

so \(Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e} = 0 - \overline{x}0 = 0\).

\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.

This also means that residuals \(e\) are always perfectly uncorrelated (Pearson correlation) with all the columns in our matrix \(\mathbf{X}\): all the variables we include in the regression model.

Key insights about regression

Bivariate regression not guaranteed to uncover true CEF if it is not linear:

  • but it is the best linear approximation of CEF, minimizing the same distance metric as defines the mean.

Key insights about regression

With the mean, \(\hat{y}\) is a projection of \(n\) dimensional \(y\) onto a line.

With bivariate regression, \(\hat{y}\) is a projection of \(y\) onto a plane - which is 2d:

  • in order to choose the vector on this 2d surface of \(X\), we need to have new basis vectors that are orthogonal.
  • \((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\) transforms \(X\) such that each column is orthogonal to the other columns in \(X\)
  • This means \(x\) is transformed to be orthogonal to \(\mathbf{1}\) - have mean of \(0\).
x = c(0,2,4,6,8,10)
y = c(0, 12, 21, 31, 40, 50)

X = cbind(1, x)
X_trans = solve(t(X) %*% X) %*% t(X) 

#dot product of transformed x, column of 1s
X_trans[2,] %*% X[,1]
##              [,1]
## [1,] 1.387779e-17

Multivariate Least Squares

Multivariate Least Squares:

Previously we predicted \(Y\) as a linear function of \(x\):

\[\hat{y_i} = b_0 + b_1 \cdot x_i\]

Now, we can imagine predicting \(y\) as a linear function of many variables:

\[\hat{y_i} = b_0 + b_1 x_1 + b_2 x_2 + \ldots + b_k x_k\]

Multivariate Least Squares:

  • When we calculated the mean using matrix algebra, we projected the \(n\) dimensional vector \(Y\) onto a point on a one-dimensional line.
  • When we calculated the bivariate regression line, we projected the \(n\) dimensional vector \(Y\) onto a \(2\)-dimensional space (one for \(b_0\) and one for \(b_1\))
  • When we use multi-variate regression, we project the \(n\) dimensional vector \(Y\) onto a \(p\) dimensional space (one for each parameter/coefficient)

Multivariate Least Squares:

What is “projecting onto \(p\) dimensions”

When we project into two dimensions, these dimensions are precisely like the \(x\) and \(y\) axes on a graph: perpendicular/orthogonal to each other.

In multivariate regression, because we are going to project \(y\) onto \(p\) orthogonal dimensions in \(\mathbf{X}\). (\((\mathbf{X}'\mathbf{X})^{-1}\) transforms to orthogonal basis)

  • “under the hood”, regression creates a new version of \(\mathbf{X}\) where each column is orthogonal to the others

Mathematical Requirements:

  1. Matrix \(X\) has “full rank”
  • This means that all of the columns of \(\mathbf{X}\) are linearly independent.
    • cannot have two identical columns
    • cannot have set of columns that sum up to another column multiplied by a scalar
  • If \(\mathbf{X}\) is not full rank, cannot be inverted, cannot do least squares.
    • but we’ll see more on the intuition for this later.
  1. \(n \geq p\): we need to have more data points than variables in our equation
  • no longer trivial with multiple regression

Multivariate Least Squares:

Examples: Linear Dependence?

1 1 0 0
1 0 1 0
1 0 0 1

Multivariate Least Squares:

Examples: Linear Dependence?

1 2 0 0
1 0 2 0
1 0 0 2

Multivariate Least Squares:

Examples: Linear Dependence?

1 0.50 0.25 0.25
1 0.25 0.50 0.25
1 0.25 0.25 0.50

Multivariate Least Squares:

When we include more than one variable in the equation, we cannot calculate slopes using simple algebraic expressions like \(\frac{Cov(X,Y)}{Var(X)}\).

  • Must use matrix algebra (this is why I introduced it)

We calculate least squares using same matrix equation (\((X'X)^{-1}X'Y\)) as in bivariate regression, but what is the math doing in the multivariate case?

Multivariate Least Squares:

When fitting the equation:

\(\hat{y_i} = b_0 + b_1x_i + b_2z_i\)

  1. \(b_1 = \frac{Cov(x^*, Y)}{Var(x^*)}\)

Where \(x^* = x - \hat{x}\) from the regression: \(\hat{x} = c_0 + c_1 z\).

  1. \(b_2 = \frac{Cov(z^*, Y)}{Var(z^*)}\)

Where \(z^* = z - \hat{z}\) from the regression: \(\hat{z} = d_0 + d_1 x\)

Does anything look familiar here?

Multivariate Least Squares:

More generally:

\[\hat{y} = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]

\(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)

where \(X_k^* = X_k - \hat{X_k}\) obtained from the regression:

\(X_{k} = c_0 + c_1 x_{1} + \ldots + c_{j} X_{j}\)

\(X_k^*\) is the residual from regressing \(X_k\) on all other \(\mathbf{X_{j \neq k}}\)

Multivariate Least Squares:

How do we make sense of \(X_k^*\) as residual \(X_k\) after regressing on all other \(\mathbf{X_{j \neq k}}\)?

  • It is the residual in same way as \(e\): \(X_k^*\) is orthogonal to all other variables in \(\mathbf{X_{j \neq k}}\).
    • it is “perpendicular” to other variables, as are axes on a graph.
    • It is perfectly uncorrelated (in the linear sense) with all other variables in the regression.

Multivariate Least Squares:

How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\) (if \(X_k^*\) as residual \(X_k\) after regressing on all other \(\mathbf{X_{j \neq k}}\)?)

  • The slope \(b_k\) is the change in \(Y\) with a one-unit change in the part of \(X_k\) that is uncorrelated with/orthogonal to the other variables in the regression.

Multivariate Least Squares:

How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)

  • Sometimes people say “the slope of \(X_k\) controlling for variables \(\mathbf{X_{j \neq k}}\)”.
    • is it “holding other factors constant”/ceteris parabis? Not quite.
    • better to think of it as “partialling out” the relationship with other variables in \(\mathbf{X}\). The \(X_k\) that does not co-vary with other variables
    • better to think of it as variation in \(X_k\) residual on the mean \(X_k\) predicted by all other variables in \(X\)
    • this residual variation has implications for how least squares weights observations

Multivariate Least Squares:

There are additional implications of defining the slope \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\):

Now we can see why columns of \(X\) must be linearly independent:

  • e.g. if \(X_1\) were linearly dependent on \(X_2\) and \(X_3\), then \(X_2\) and \(X_3\) perfectly predict \(X_1\).
  • If \(X_1\) is perfectly predicted by \(X_2\) and \(X_3\), then the residuals \(X_1^*\) will all be \(0\).
  • If \(X_1^*\) are all \(0\)s, then \(Var(X_k^*) = 0\), and \(b_k\) is undefined.

Conclusion

  1. Least Squares generalizes the mean by predicting \(y\) by choosing closest \(\hat{y}\) in space defined by our equation of \(x\) variables.
  2. This is orthogonal projection: residuals \(e\) are orthogonal to \(X\), \(\beta\) calculated using orthogonal transformation of \(X\) (variation in \(x\) orthogonal to other variables in \(X\)).
  3. These properties help us understand:
  • how to choose correct equation to estimate causal estimands w/ regression
  • understanding how regression estimates relate to causal estimands

Exercises

Exercise

set.seed(1234)
n = 1000
u = rnorm(n)
x = u + rnorm(n, sd = 0.5)
z = sqrt(u^2) + rnorm(n, sd = 0.5)
y = 0 + 1*x - z^2  +  rnorm(n)

Generate the data using code above.

Exercise

In R:

  1. Create the design matrix \(X\) to estimate \(Y_i = b_0 + b_1 x_i + b_2 z_i\)
  2. Use the matrix equation for least squares to obtain \(\beta\) (\(b_0, b_1, b_2\))
  3. Calculate \(x^*\) as \(x_i - \hat{x}_i\), where \(\hat{x}_i\) comes from regressing \(x\) on an intercept and \(z\).
  4. What is the correlation of \(x^*\) and \(z\)?
  5. Plot \(x^*\) and \(z\)
  6. Is \(x^*\) independent of \(z\)?

Derivation of Bivariate Regression

Deriving Bivariate Regression Formula

We want to choose \(a, b\) such that \(\mathbf{\hat{y}} = a + b\cdot \mathbf{x}\) has minimum distance to \(\mathbf{y}\)

Another way of thinking of this is in terms of residuals, or the difference between true and predicted values using the equation of the line. (Prediction errors)

\(\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}\)

Minimizing the distance also means minimizing the sum of squared residuals or length of vector \(\mathbf{e}\)

Minimizing the Distance

(Proof itself is not something you need to memorize)

We need to solve this equation:

\[\min_{a,b} \sum\limits_i^n (y_i - a - b x_i)^2\] Choose \(a\) and \(b\) to minimize this value, given \(x_i\) and \(y_i\)

We can do this with calculus: solve for when first derivative is \(0\) (since this means distance will be at its minimum)

Minimizing the Distance

First, we take derivative with respect to \(a\): yields:

\(-2 \left[ \sum\limits_i^n (y_i - a - b x_i) \right] = 0\)

\(\sum\limits_i^n y_i - \sum\limits_i^n a - \sum\limits_i^n b x_i = 0\)

\(-\sum\limits_i^n a = -\sum\limits_i^n y_i + \sum\limits_i^n b x_i\)

\(\sum\limits_i^n a = \sum\limits_i^n y_i - b x_i\)

Minimizing the Distance

\(\sum\limits_i^n a = \sum\limits_i^n y_i - b x_i\)

Dividing both sides by \(n\), we get:

\(a = \bar{y} - b\bar{x}\)

Where \(\bar{y}\) is mean of \(y\) and \(\bar{x}\) is mean of \(x\).

Implication: regression line goes through the point of averages \(\bar{y} = a + b \bar{x}\)

Minimizing the Distance

Next, we take derivative with respect to \(b\):

\(-2 \left[ \sum\limits_i^n (y_i - a - b x_i) x_i\right] = 0\)

\(\sum\limits_i^n (y_i - (\bar{y} - b\bar{x}) - b x_i) x_i = 0\)

\(\sum\limits_i^n y_ix_i - \bar{y}x_i + b\bar{x}x_i - b x_ix_i= 0\)

Minimizing the Distance

\(\sum\limits_i^n (y_i - \bar{y})x_i = b\sum\limits_i^n (x_i - \bar{x})x_i\)

Dividing both sides by \(n\) gives us:

\(\frac{1}{n}\sum\limits_i^n y_ix_i - \bar{y}x_i = b\frac{1}{n}\sum\limits_i^n x_i^2 - \bar{x}x_i\)

\(\overline{yx} - \bar{y}\bar{x} = b \overline{xx} - \bar{x}\bar{x}\)

\(Cov(y,x) = b \cdot Var(x)\)

\(\frac{Cov(y,x)}{Var(x)} = b\)

Proof LS Matrix Solution is the Same

Just what are the matrices doing?

But we also want to know more intuitively what these matrix operations are doing! It isn’t magic.

  • We will walk through what exactly the matrix calculations do for us.

\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]

\[\mathbf{X}'\mathbf{X} = \begin{pmatrix} 1 & \ldots & 1 \\ x_1 & \ldots & x_n \end{pmatrix} \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}\]

\[= \begin{pmatrix} n & \sum_i x_i \\ \sum_i x_i & \sum_i x_i^2 \end{pmatrix} = n \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{x^2} \end{pmatrix}\]

\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]

\[\mathbf{X}'\mathbf{Y} = \begin{pmatrix} 1 & \ldots & 1 \\ x_1 & \ldots & x_n \end{pmatrix} \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\] \[= \begin{pmatrix} \sum_i y_i \\ \sum_i x_i y_i \end{pmatrix} = n \begin{pmatrix} \overline{y} \\ \overline{xy} \end{pmatrix}\]

Inverting Matrices

How do we get \(^{-1}\)? This is inverting a matrix.

  • Inverse \(A^{-1}\) of matrix \(A\) that is square or \(p\times p\) has the property:

\[A \times A^{-1} = A^{-1} \times A = I_{p \times p} = \begin{pmatrix} 1 & 0 & \ldots & 0 \\ 0 & \ddots & \ldots & 0 \\ 0 & \ldots & \ddots & 0 \\ 0 & \ldots & 0 & 1 \end{pmatrix}\]

This is an identity matrix with 1s on diagonal, 0s everywhere else.

Inverting Matrices

We need to get the determinant

For sake of ease, will show for a scalar and for a \(2 \times 2\) matrix:

\[det(a) = a\]

\[det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - cb\]

Inverting Matrices

Then we need to get the adjoint. It is the transpose of the matrix of cofactors (don’t ask me why):

\[adj(a) = 1\]

\[adj\begin{pmatrix} a & b \\ c & d \end{pmatrix} = \begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]

Inverting Matrices

The inverse of \(A\) is \(adj(A)/det(A)\)

\[A^{-1} = \frac{1}{ad - cb}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]

Deriving Least Squares

\[A^{-1} = \frac{1}{ad - cb}\begin{pmatrix} d & -b \\ -c & a \end{pmatrix}\]

\[(\mathbf{X}'\mathbf{X}) = n \begin{pmatrix} 1 & \overline{x} \\ \overline{x} & \overline{x^2} \end{pmatrix}\]

\[(\mathbf{X}'\mathbf{X})^{-1} = \frac{n}{n^2(\overline{x^2} - \overline{x}^2)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix}\]

\[(\mathbf{X}'\mathbf{X})^{-1} = \frac{1}{n \cdot Var(x)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix}\]

Deriving Least Squares

We can put it together to get: \((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\)

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \frac{1}{n \cdot Var(x)} \begin{pmatrix} \overline{x^2} & -\overline{x} \\ -\overline{x} & 1 \end{pmatrix} \frac{n}{1} \begin{pmatrix} \overline{y} \\ \overline{xy} \end{pmatrix}\]

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x}\ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]

Deriving Least Squares

The slope:

\[\beta = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x} \ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]

\[b = \frac{\overline{xy} - \overline{x} \ \overline{y}}{Var(x)} = \frac{Cov(x,y)}{Var(x)}\]

\[b = \frac{Cov(x,y)}{Var(x)} = r \frac{SD_y}{SD_x}\]

Deriving Least Squares

The slope:

  • Expresses how much mean of \(Y\) changes for a 1-unit change in \(X\)

The Intercept:

\[\beta = \frac{1}{Var(x)} \begin{pmatrix} \overline{x^2}\overline{y} -\overline{x} \ \overline{xy} \\ \overline{xy} - \overline{x} \ \overline{y}\end{pmatrix} = \begin{pmatrix}a \\ b \end{pmatrix}\]

\[a = \frac{\overline{x^2}\overline{y} -\overline{x} \ \overline{xy}}{Var(x)} = \frac{(Var(x) + \overline{x}^2)\overline{y} - \overline{x}(Cov(x,y) + \overline{x}\overline{y})}{Var(x)}\]

\[= \frac{Var(x)\overline{y} + \overline{x}^2\overline{y} - \overline{x}^2\overline{y} - \overline{x}Cov(x,y)}{Var(x)}\]

\[= \overline{y} - \overline{x}\frac{Cov(x,y)}{Var(x)}\]

\[a = \overline{y} - \overline{x}\cdot b\]

Deriving Least Squares

The Intercept:

\[a = \overline{y} - \overline{x}\cdot b\]

Shows us that at \(\bar{x}\), the line goes through \(\bar{y}\). The regression line (of predicted values) goes through the point \((\bar{x}, \bar{y})\) or the point of averages.

Linear Algebra Practice

Examples/ Practice

\[ A = \begin{pmatrix} 1 & 5 \\ 1 & 7 \\ 1 & 8 \\ 1 & 11 \end{pmatrix}\]

Find

\[A'A = ?\]

\[AA' = ?\]

Examples/ Practice

\[ A = \begin{pmatrix} 1 & 5 \\ 1 & 7 \\ 1 & 8 \\ 1 & 11 \end{pmatrix}; B = \begin{pmatrix} 6 \\ 5 \\ 3 \\ -1 \end{pmatrix}\]

Find

\[A'B = ?\]

\[AB = ?\]

Examples/ Practice

\[u = \begin{pmatrix} 1 \\ -2 \\ 6 \end{pmatrix}; \ v =\begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix} \]

And

\[(u - a) \cdot v = 0\]

Solve for \(a\)